### Make sure all the packages are installed
if (!requireNamespace("Seurat", quietly = TRUE))
install.packages("Seurat")
if (!requireNamespace("tidyverse", quietly = TRUE))
install.packages("tidyverse")
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
if (!requireNamespace("colorBlindness", quietly = TRUE))
install.packages("colorBlindness")
if (!requireNamespace("DT", quietly = TRUE))
install.packages("DT")
if (!requireNamespace("scales", quietly = TRUE))
install.packages("scales")
if (!requireNamespace("tictoc", quietly = TRUE))
install.packages("tictoc")
if (!requireNamespace("ggalluvial", quietly = TRUE))
install.packages("ggalluvial")
### Load all the necessary libraries
library(Seurat)
library(tidyverse)
library(devtools)
library(colorBlindness)
library(DT)
library(scales)
library(RColorBrewer)
library(scales)
library(tictoc)
library(ggalluvial)
set.seed(687)8 - Subclustering
Introduction
In this vignette we will examine methods for increasing resolution on cell subtypes and cell states. We will compare two methods: increasing resolution and other parameters to find more clusters, and subclustering. Subclustering is the process of clustering, subsetting to one cluster, then running the clustering pipeline again. In high-dimensional datasets, especially ones with lots of technical or biological noise, focusing on specific celltypes individually improves detection of subtype and state patterns. Highly variable gene selection and latent-space calculation are both affected by noise and outliers. Subclustering can also improve computational efficiency - finding small clusters can be expensive if working with the full dataset.
However, it’s important to keep in mind that iterative subclustering can lead to “overfitting” the data. This means we might identify noise as clusters, and we will have to contend more with the “curse of dimensionality” in downstream analysis. We should always validate our clusters according to expression of marker genes, use technical replicates, bootstrapping methods, or check their existence in external datasets.
Vocabulary
Subclustering
The process of dividing a previously identified cluster into smaller, more detailed clusters (subclusters). Subclustering is used to uncover finer, often subtle distinctions within a dataset that might not be visible in an initial analysis.
Overfitting
Overfitting is when an analyst describes random noise in the data rather than underlying general relationships. Overfit models perform well on their dataset, but very poorly on other data.
Curse of Dimensionality
The term data analysts use to describe phenomena that appear when analyzing data in high-dimensional spaces. As dimensionality increases, the data can become sparse. Sparsity is problematic for statistical significance testing. Additionally, by increasing dimensions, we increase the number of false positives when using p-value thresholds.
Parameter scan
AKA parameter sweep, this is the process of systematically varying parameters in an algorithm to analyze the effects of their changes on the outcome. Parameter scans are widely used in computationaly biology to identify optimal parameters or test the stability of our models.
Key Takeaways
Recomputing highly variable genes at each subclustering step resets the biological universe we are looking at to the capture the “new” sources of variability.
Iterative subclustering is essential to uncover fine grained populations
In addition to finding fine-grained populations, subclustering can help create better divisions between the different “species” of celltypes and subtypes.
Libraries
Load data
We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from the cellxgene portal.
se <- readRDS("../data/se_lvl1.rds")Subset T cell compartment for downstream analysis
tse <- se[, se$annotation_lvl1 == "T cells"]Color palette
# Set color palette for cell types
pal <- paletteMartin
names(pal) <- sort(unique(se$Celltype))
donor_pal <- c(
"#66C2A4", "#41AE76", "#238B45", "#006D2C",
"#41B6C4", "#1D91C0", "#225EA8", "#253494", "#081D58",
"#FFEDA0", "#FED976", "#FEB24C", "#f79736", "#FD8D3C",
"#FC4E2A", "#E31A1C", "#BD0026", "#ad393b", "#800000", "#800050")
names(donor_pal) <- c(
"Normal 1", "Normal 2", "Normal 3", "Normal 4",
"Flu 1", "Flu 2", "Flu 3", "Flu 4", "Flu 5",
"nCoV 1", "nCoV 2", "nCoV 3", "nCoV 4", "nCoV 5",
"nCoV 6", "nCoV 7", "nCoV 8", "nCoV 9", "nCoV 10", "nCoV 11"
)To get up to speed with the previous worksheets, process the data in the same way.
se <- se %>%
NormalizeData(verbose = FALSE) %>%
FindVariableFeatures(
method = "vst",
nfeatures = 3000,
verbose = FALSE) %>%
ScaleData(verbose = FALSE, features = VariableFeatures(.))Let’s extract the top 3000 HVGs from the whole data across all cell types.
hvg_full <- VariableFeatures(se)Analysis
Finding rare celltypes
For many of us, our first idea for finding rare celltypes is to modulate the parameters of what we have already to find the right celltypes. As we saw in the previous clustering notebook, we can find NK and T cell clusters this way, but it still seems like there’s some heterogeneity in the clusters we’ve found. For illustration, I’ve run a parameter scan on the following variables:
FindVariableFeaturesnfeaturesRunPCAnpcs,FindNeighborsk.param,FindClustersresolution
parameter_df <- expand.grid(
nf = c(2000, 3000, 5000),
pc = c(20, 30, 50),
k = c(10, 30, 50),
res = c(0.5, 0.8, 1.2)
)
seurat_parameter_scan <- function(srobj, nf, pc, k, res) {
srobj <- srobj %>%
NormalizeData() %>%
FindVariableFeatures(selection.method = "vst", nfeatures = nf) %>%
ScaleData(features = head(VariableFeatures(object = .), nf)) %>%
RunPCA(npcs = pc, verbose = FALSE) %>%
FindNeighbors(dims = 1:pc, k.param = k) %>%
FindClusters(resolution = res, verbose = FALSE)
# Extract Idents and name the output vector
idents <- Idents(srobj)
names(idents) <- paste("nf", nf, "pc", pc, "k", k, "res", res, sep="_")
return(idents)
}
# Apply the function to each row of parameter_df and combine results into a single data frame
paramscan <- lapply(seq_len(nrow(parameter_df)), function(i) {
params <- parameter_df[i, ]
idents_vector <- seurat_parameter_scan(se, params$nf, params$pc, params$k, params$res)
return(idents_vector)
})
# name cluster columns after parameters used to obtain them
names(paramscan) <- parameter_df %>%
mutate(params = pmap_chr(., function(...) {
cols <- colnames(parameter_df)
values <- list(...)
paste0(cols, values, collapse = "_")
})) %>%
pull(params)
saveRDS(paramscan, '../data/covid_flu_srobj_clusters_paramscan.rds')paramscan <- readRDS('../data/covid_flu_srobj_clusters_paramscan.rds')
paramscan <- bind_cols(paramscan)
row.names(paramscan) <- colnames(se)
paramscan_long <- paramscan %>%
rownames_to_column(var = "cell_id") %>%
pivot_longer(
cols = -cell_id,
names_to = "parameter",
values_to = "cluster"
) %>%
mutate(
nf = as.numeric(gsub(".*nf(\\d+)_.*", "\\1", parameter)),
pc = as.numeric(gsub(".*pc(\\d+)_.*", "\\1", parameter)),
k = as.numeric(gsub(".*k(\\d+)_.*", "\\1", parameter)),
res = as.numeric(gsub(".*res(\\d+\\.\\d+).*", "\\1", parameter))
) %>%
group_by(parameter) %>%
mutate(
n_clusts = max(as.numeric(cluster))
) %>%
select(-parameter)
ggplot(paramscan_long %>%
select(-cell_id, -cluster) %>%
distinct, aes(x = factor(pc),
y = n_clusts,
fill = factor(k))) +
geom_bar(stat='identity') +
facet_grid(paste('k=',k) + paste('nf =',nf) ~ paste('resolution =',res)) +
scale_y_continuous(labels = scales::label_number()) +
scale_fill_manual(values = unname(donor_pal[c(1,6,11)])) +
labs(
title = "Number of clusters Across Parameters",
x = "Clustering resolution + nPCs",
y = "Number of clusters",
fill = "k param") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) Take a close look at the results. When we modulate nfeatures or nPCs, we don’t directly see changes in the number of celltypes found. But, like we saw in the previous sessions, modulating the k.param and the clustering resolution have outsized effects on the number of clusters found.
But if we’re looking for a rare celltype, we must include more information, right? It makes most sense to increase both the nfeatures, nPCs, AND clustering resolution to find that celltype - because we need to make sure we are including the genes that define the celltype, and making small enough clusters to be able to find it.
Comparing subclustering - using full dataset HVGs within only the subset
To do this we will focus on the T cell compartment.
First let’s process our T cell subset object with the HVG obtained from the whole dataset - hvg_full:
tse_full <- tse %>%
NormalizeData() %>%
ScaleData(
verbose = FALSE,
features = hvg_full) %>%
RunPCA(
features = hvg_full,
npcs = 20,
verbose = FALSE) %>%
FindNeighbors() %>%
FindClusters(resolution = c(0.05, 0.1, 0.15, 0.2), verbose = FALSE)
# Visualize these clusters
DimPlot(
tse_full,
reduction = 'umap',
group.by = c("RNA_snn_res.0.05", "RNA_snn_res.0.1",
"RNA_snn_res.0.15", "RNA_snn_res.0.2"))dim_full <- DimPlot(
tse_full,
reduction = 'umap',
group.by = "RNA_snn_res.0.15")Now let’s recompute the HVG for the T cell subset to capture the variability within that subset:
tse_sub <- tse %>%
FindVariableFeatures(
method = "vst",
nfeatures = 3000,
verbose = FALSE) %>%
ScaleData(verbose = FALSE, features = VariableFeatures(.)) %>%
RunPCA(
npcs = 20,
verbose = FALSE
) %>%
FindNeighbors() %>%
FindClusters(resolution = c(0.05, 0.1, 0.15, 0.2), verbose = FALSE) %>%
RunUMAP(dims = 1:20, verbose = FALSE)
# Visualize these clusters
DimPlot(
tse_sub,
reduction = 'umap',
group.by = c("RNA_snn_res.0.05", "RNA_snn_res.0.1",
"RNA_snn_res.0.15", "RNA_snn_res.0.2"))dim_sub <- DimPlot(
tse_sub,
reduction = 'umap',
group.by = "RNA_snn_res.0.15")Right off the bat, how do these UMAPs compare to each other
dim_full + dim_subWhat you’re probably wondering first is, hey would I have found these clusters if I had just increased the number of clusters I used?
data_alluvial <- data.frame(
bc = colnames(tse_full),
full_hvg = tse_full$RNA_snn_res.0.15,
sub_hvg = tse_sub[, colnames(tse_full)]$RNA_snn_res.0.15) %>%
dplyr::count(full_hvg, sub_hvg)
ggplot(data = data_alluvial,
aes(axis1 = full_hvg, axis2 = sub_hvg, y = n)) +
geom_alluvium(aes(fill = full_hvg), width = 0.1) +
geom_stratum(width = 0.1) +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_minimal() +
labs(title = "Alluvial plot of Clustering using full data HVG or subcluster HVG",
x = "Cluster assignment",
y = "N")Both clustering resolutions seem to be very similar but using specific HVG we are able to detect one more cluster at the same resolution and, in theory, the clusters should be more specific.
If we check for example the overlap of highly variable features between the two methods, we’ll find there are very few overlapping
table(VariableFeatures(tse_sub) %in% hvg_full)
FALSE TRUE
1064 1936
head(setdiff(VariableFeatures(tse_sub), hvg_full), n = 50) [1] "BTK" "PTMA" "UQCRH" "CLIC4" "MBOAT2" "COL9A3" "MSRB3" "MAP7" "JUP" "TIMM13" "NDUFS6" "SLC25A3" "SLC45A3" "MTDH" "MRPL52" "FKBP1A" "PSMB3" "HNRNPD" "ZNF706" "ATP5MF" "YBX3" "XRCC5" "HDGF" "ACTR3" "SAPCD2" "ATP5PF" "COX7B" "ANP32A" "ATP5F1C" "CHCHD2" "SRSF3" "VCP" "SSR1" "LCA5" "SMS" "ERH" "CFL1" "RPLP0" "UQCRC1" "CALM3" "TIMM8B" "NDUFAB1" "PSMA7" "COX8A" "NCOA3" "RP11-281P23.1" "EIF5A" "UQCRQ" "GAS2L3" "PSMB8"
intersect(VariableFeatures(tse_sub), hvg_full) [1] "JCHAIN" "IGKC" "MZB1" "IGHG4" "HBB" "IGHG1" "IGHG3" "HBA2" "IGLC2" "IGHA1" "H4C3" "HBA1" "ITM2C" "UBE2C" "IGLC3" "TUBA1B" "IGHA2" "TNFRSF17" "HSP90B1" "IGHG2" "CD79A" "STMN1" "DERL3" "RRM2" "TYMS" "TOP2A" "MKI67" "IGHM" "PCLAF" "PPBP" "TUBB" "POU2AF1" "SEC11C" "HSPA1A" "PTGDS" "CCL3" "MYBL2" "LYZ" "PCNA" "FABP5" "BIRC5" "S100A8" "IGLL5" "S100A9" "CCL4L2" "NUSAP1" "HMGB2"
[48] "IFI27" "CENPF" "CCNB1" "FAM30A" "H1-5" "IGLV3-1" "CCNB2" "CCL20" "PLK1" "CDC20" "IGLV6-57" "H2AZ1" "IGHV6-1" "IGKV4-1" "SSR3" "SOX4" "TK1" "DUT" "PRDX4" "CCL4" "HSPD1" "IFIT2" "IGKV3-15" "PF4" "ASPM" "UCHL1" "MYDGF" "CKS2" "EAF2" "IGKV3-20" "S100B" "TNFRSF13B" "IGHV4-39" "IGKV3-11" "CDK1" "CDKN3" "MCM7" "IGHV3-7" "IGHGP" "ZWINT" "AQP3" "TUBA1C" "HMGA1" "IGLV3-19" "PRDX1" "GAPDH" "COTL1"
[95] "PPIB" "MT1G" "UBE2J1" "COBLL1" "IGHV3-72" "ALAS2" "CKS1B" "CENPM" "CDT1" "HSPA6" "CCNA2" "TPX2" "IGKV1-39" "CST3" "BASP1" "GINS2" "CLSPN" "CA1" "SMIM24" "SMC4" "RNASE6" "S100A12" "SDC1" "HMMR" "CCND2" "LINC01781" "PDIA4" "PTTG1" "FKBP11" "TUBB4B" "CD38" "SPINK2" "LILRB4" "MEF2C" "CSF2" "CYTL1" "RP11-16E12.2" "PDIA6" "MCM5" "TNF" "KPNA2" "MCM4" "CENPW" "DLGAP5" "SUB1" "RP11-1070N10.3" "IGLV2-23"
[142] "NRGN" "PHGDH" "AREG" "CAV1" "XCL1" "HSPA1B" "PTP4A3" "CD27" "SDF2L1" "FCER1A" "HMGN2" "H1-3" "GNG4" "NT5DC2" "UBE2S" "NASP" "VCAN" "AURKB" "NME1" "IGLV4-69" "RANBP1" "IGHV1-46" "CENPU" "MCM3" "JPT1" "LMAN1" "DHFR" "RP11-1143G9.4" "IGKV1-5" "SRM" "SMC2" "IGLVI-70" "ACTB" "AVP" "GAS6" "CPA3" "FOS" "MAD2L1" "ACTG1" "KLHL14" "HLA-DRA" "IGLV1-51" "SPC25" "IGLV1-40" "CDCA5" "HES4" "H1-2"
[189] "GMNN" "IGHV3-64D" "CMTM5" "TRBV14" "ERG" "CENPA" "SHCBP1" "CPNE5" "IGHV1-18" "H2AC16" "RP11-1I2.1" "PRDX3" "HMGB1" "KIFC1" "FCRL5" "H2AC12" "CDCA8" "IGHV3-53" "PDLIM1" "SLC25A5" "SLC2A5" "HMGB3" "SKA3" "TCF4" "GTSE1" "ENO1" "PAICS" "BANK1" "H2AX" "CDO1" "GGH" "RAN" "CTSH" "KIF11" "TXN" "MYCN" "RPL22L1" "IGHV3-30" "H2AC20" "RPA3" "CENPE" "CDCA7" "SPCS3" "PKMYT1" "AHSP" "CDC6" "CD34"
[236] "RGS18" "GPRC5D" "EIF4A3" "GP9" "CDCA2" "KIF2C" "IFIT3" "HJURP" "GNLY" "DUSP4" "MARCKS" "HELLS" "NCAPG" "IGHV5-51" "AIF1" "CNRIP1" "H3C8" "EGR1" "KIF23" "CDK6-AS1" "MANF" "PHF19" "UBE2T" "HSP90AA1" "HBG2" "EGFL7" "CCT5" "FBXO5" "BEX1" "WARS1" "SH2B2" "PPA1" "CENPN" "FCGR2B" "HBD" "CCR10" "RHOB" "IGF1" "RAD51AP1" "GATA2" "CDCA3" "ATP5F1B" "IGHV1-24" "LDHA" "CA2" "RNASE2" "XBP1"
[283] "SNCA" "TPI1" "IER3" "CD9" "CDC45" "IGLV4-60" "CACYBP" "IGFL2" "GZMK" "PMCH" "HLA-DRB5" "BUB1" "IGHV3-66" "IGHV3-48" "DNAJC9" "PGAM1" "GNG7" "TMPO" "ATP5MC1" "ASF1B" "IGKV2-30" "FKBP2" "TXNDC5" "RRM1" "PLD4" "IGHV2-5" "CALR" "IFNG" "EREG" "MS4A1" "TIMD4" "MCM2" "ATP5MC3" "CLPTM1L" "ANXA2" "ANXA5" "TROAP" "IGHV1-69D" "PRSS57" "TRBV7-2" "IGHD" "NUF2" "CAVIN1" "IGHV3-23" "RGS1" "MRPL51" "PNOC"
[330] "H1-0" "PKM" "UHRF1" "IGKV1-17" "TUBB1" "CTSZ" "CD79B" "NPR3" "LMO2" "PTCRA" "C1QBP" "FCRLA" "ABCB9" "PPP1R14A" "YWHAE" "KLRC1" "CPVL" "TXNDC17" "AHCY" "HNRNPAB" "PLPP5" "HLA-DQA1" "TFPI" "TRPM4" "ODC1" "CEP55" "LTB" "PBK" "MT1E" "TRIB1" "E2F1" "HSPE1" "CAVIN2" "RPN2" "TKTL1" "MYL9" "CKAP2L" "HBM" "FAM111B" "SNRNP25" "DCTPP1" "MT2A" "CRELD2" "DNAAF1" "ENTPD1" "HES1" "PKHD1L1"
[377] "PPP1R14B" "ANP32E" "DTYMK" "KLF1" "RRBP1" "PROM1" "LINC02573" "NDC80" "ID1" "ENHO" "NPW" "BLNK" "SPC24" "CHEK1" "MAFB" "RASSF6" "CLEC10A" "GNG11" "BCL11A" "C1QTNF4" "MYOM2" "NCAPH" "KCNN3" "PSMB2" "FCN1" "TM4SF1" "CD1C" "BHLHE41" "LMNB1" "AURKA" "HSP90AB1" "TACC3" "GATA1" "PSMA4" "RGCC" "SLC4A1" "SSR4" "ACRBP" "EBNA1BP2" "CCL3L1" "LINC01480" "ORC6" "KNL1" "MAN1A1" "APOBEC3B" "CD180" "KDELR2"
[424] "P2RX5" "P4HB" "E2F2" "SPIB" "ALYREF" "SEC61B" "RP11-354E11.2" "FEN1" "H3C7" "SGO2" "IGHV4-34" "JUN" "XCL2" "IMPDH2" "PPIA" "RNU12-ENSG00000270022" "ESCO2" "SPI1" "NXF3" "ERLEC1" "DYNC2I2" "MIR23AHG" "IGLL1" "DDX39A" "ANLN" "LRRC59" "FOXM1" "CDKN1A" "TCL1A" "CXXC5" "HSPA5" "LAPTM4B" "DEK" "CHCHD10" "CH25H" "CKAP4" "HLA-DQB1" "C2orf88" "MS4A6A" "SRPRB" "IFIT1" "LGALS1" "SSRP1" "ECT2" "MNDA" "SLBP" "DPEP1"
[471] "TMEM106C" "IGKV3D-15" "IGHV4-4" "VDAC1" "SEC61G" "CXCR3" "HM13" "FCRL2" "OSTC" "CD74" "ANP32B" "DTL" "SAE1" "SEMA4A" "IGLV3-9" "KIF14" "GBP1" "GGCT" "ATP1B3" "YBX1" "SNRPE" "PSME2" "MIF" "ISOC2" "BIK" "TALDO1" "PLS3" "MCM6" "LSM4" "NLRP7" "NPM1" "KIF22" "IDH2" "ACTN1" "NDUFB3" "NANS" "PA2G4" "NCF1" "CORO1A" "H1-4" "MX1" "ZNF385D" "SGO1" "IGLV3-21" "IL10" "APOC1" "TXNDC11"
[518] "DIAPH3" "IGKV1-9" "MACROH2A1" "HLA-DRB1" "MRPL12" "KIF15" "NUDT1" "GMPPB" "CD19" "TUBG1" "LAMP5" "VIM" "CLEC1B" "TNFRSF4" "IRF4" "CAPG" "PIM2" "FANCI" "PYCR1" "LRR1" "CRYGD" "DTNB-AS1" "SKA1" "PMAIP1" "IGLC5" "DIPK1B" "PARM1" "MPIG6B" "ESAM" "ALDH2" "CCT8" "BUB1B" "PSMA5" "TMEM40" "THBS1" "SGK1" "EPCAM" "COX5A" "GIHCG" "DEPDC1B" "LIG1" "RGS16" "H2AZ2" "HP" "HSPH1" "PARP1" "SELL"
[565] "TCP1" "CCDC167" "TRAP1" "RHEX" "H2AC6" "CRHBP" "LGALS3" "TMEM258" "C3orf14" "TRAC" "CADM1" "MCM10" "CBX3" "SKA2" "AIRE" "ZBTB32" "MXD3" "GZMA" "HLA-DQA2" "H2BC7" "ACTG2" "SNRPG" "SPCS2" "IGLV2-11" "CARHSP1" "MLEC" "H2BC9" "PKLR" "RTKN2" "TOR3A" "PFN1" "IGHV3-43" "JSRP1" "RETN" "H2BC8" "DYNLL1" "SLC43A3" "IGKV1D-8" "RFC4" "CHAC1" "HBG1" "CYBB" "DENND5B" "IFI30" "SPTA1" "CD8A" "MYL4"
[612] "SERPINA1" "CBX5" "IGLC7" "CYC1" "FCER1G" "SHMT2" "EMID1" "TRAV1-1" "IGLV3-25" "IGLV2-14" "LINC02446" "PDCD5" "CCR7" "TFDP1" "DEPDC1" "IGHV4-31" "MTHFD1" "LST1" "TTK" "ISG15" "PRDX2" "HNRNPA2B1" "IGHV3-21" "BLK" "POLQ" "CD8B" "LMAN2" "STOML2" "POPDC3" "CXCR6" "TMED9" "LIMS1" "CCT7" "PRMT1" "TREML1" "CD70" "PPIF" "CDCA4" "H2AC14" "PRC1" "MYC" "MGLL" "DCPS" "CDK4" "ASNS" "CIP2A" "SAC3D1"
[659] "FH" "SLC38A5" "YWHAH" "IGLV2-8" "DNMT1" "LMNA" "LYL1" "TRGC1" "FOSB" "IGLV1-47" "NEK2" "CCDC50" "HTR1F" "ARID3A" "FTL" "LTBP1" "ATP5F1A" "KIF4A" "ARHGAP11A" "GADD45A" "CTNNAL1" "EIF1AY" "GUCY1B1" "MT-ND6" "ATAD2" "ACY3" "IGKV1D-39" "IGLV8-61" "CCT6A" "E2F8" "EZH2" "IGLV5-48" "CLDN3" "RP11-301G19.1" "PHB" "RACGAP1" "ZNF521" "RBBP8" "KLRC2" "HPGD" "TSC22D1" "RASGRP3" "IGHV4-28" "H1-10" "ITGA2B" "EVA1B" "BANF1"
[706] "TKT" "MND1" "SLIRP" "MTFR2" "KCNH2" "EIF2S2" "SPAG5" "ARF4" "DNAJB11" "IGFBP7" "IFI44L" "DDOST" "PRR11" "TRGC2" "LDHB" "PDXK" "CCT2" "BRCA1" "CD14" "BCL2L12" "PIF1" "PRKAR2B" "NR4A1" "CTLA4" "NEXN" "IGHV3-33" "ACOT7" "OAS1" "CENPH" "SPATS2" "ANGPT1" "USP1" "CLU" "THBD" "ALDH1A1" "C11orf96" "TPM4" "MYB" "CHPF" "RBM24" "BLVRB" "PPM1G" "CXCL13" "TRIP13" "CHAF1A" "IGKV2-24" "FKBP4"
[753] "CSTA" "TPD52" "IER2" "OSTN-AS1" "CEBPD" "SERPINH1" "LGALS2" "PLAAT2" "ATAD5" "KIF20A" "SMC1A" "MTCH2" "CCNF" "TIMM10" "TRBV13" "SMC3" "ICOS" "MRPL37" "IGHV2-70" "SNRPF" "OTUD1" "TMPRSS11E" "HSPB11" "ITGA2" "ELL2" "TXNDC15" "SLC25A4" "SELENOS" "H4C8" "ID3" "LAIR2" "ATF5" "GLDC" "PPP1R2C" "CNN2" "TAMALIN" "HLA-DMA" "AXL" "HDLBP" "DUSP5" "F2RL3" "SHMT1" "CCDC34" "DUSP6" "IFITM3" "TSPAN13" "NPM3"
[800] "E2F7" "SERPINB1" "RPS4Y1" "PDGFA" "TMEM156" "IL21-AS1" "MELK" "H1-1" "HSPA8" "ENSG00000277856.1" "IGHV3-20" "CIBAR2" "TYMP" "MAP1B" "GZMB" "CCR6" "PASK" "IL1B" "POLR3K" "RALGPS2" "DUSP2" "CORO1B" "CANX" "CDC25B" "CCT3" "PRKDC" "MME" "NFKBIA" "ADGRG6" "UNC13C" "CCDC88A" "PRELID1" "CLC" "SPON2" "ANXA3" "NCAPD2" "POC1A" "TIMP3" "ORC1" "BZW2" "ARL6IP1" "FXYD2" "NPIPA2" "CPEB4" "COCH" "CCNE1" "YEATS4"
[847] "TCF19" "AC004540.4" "WDR76" "TGFB1I1" "FAM178B" "OIP5" "ETV1" "CALU" "SPDL1" "TSPAN12" "VSIG4" "CCR3" "PMVK" "FOXP3" "SEC61A1" "SUPT16H" "FKBP3" "TLR10" "CD59" "ELFN1-AS1" "IL6R" "RAB13" "PSMG1" "MGST1" "RAB32" "LAP3" "ZNF683" "MIR4435-2HG" "NUDT5" "G0S2" "MRPS34" "LSM5" "GP1BA" "CDCA7L" "H2AC11" "POLD2" "DPPA4" "TRIM55" "PTPRD" "BATF" "RP5-1028K7.2" "NEGR1" "PSMD1" "PAQR4" "FUT7" "SNRPB" "IGHV3-11"
[894] "CISD2" "SSPN" "VWDE" "ST6GALNAC4" "GRN" "TNFRSF18" "NEFL" "HSD17B10" "GNA15" "NFKBID" "CES1" "PSMA3" "CD69" "ALG5" "RCC1" "EIF2S1" "B9D1" "ZNF215" "TRIM28" "FAM3C" "MT1H" "ZG16B" "ALDH1L2" "BRCA2" "VCAM1" "LY6G6F" "KCNMA1" "RFC5" "VRK1" "BARX1" "RFC2" "CPXM1" "PSAT1" "NOP56" "HADH" "MTHFD2" "EGR3" "PCDH9" "TRBV28" "SEL1L3" "TRAV20" "TIMELESS" "KIF20B" "CLEC11A" "CDK6" "ENSG00000276345.1" "CRIP1"
[941] "COPS3" "SESN3" "SYK" "CTB-50L17.16" "MRC1" "COPB2" "TRDC" "ATIC" "IFI27L1" "COL6A2" "KCNQ1OT1" "CALD1" "MIR193BHG" "NPTX2" "NUCKS1" "TP53INP1" "CENPX" "CISD1" "VOPP1" "RUVBL2" "ASS1" "TAL1" "HIRIP3" "LGALSL" "GCSH" "MPP1" "CLECL1" "KCTD12" "RP11-160E2.6" "NPDC1" "TSHZ2" "CLDND1" "FPR1" "MYLK" "CFD" "RPPH1-ENSG00000259001" "ADAM19" "MRPL13" "CKAP2" "INCENP" "ITGB3BP" "LAG3" "TIMP2" "NCAPG2" "TPSB2" "LXN" "BOLA3"
[988] "HPRT1" "MMRN1" "GSN" "RHAG" "TMEM176B" "MS4A2" "GSTM3" "FAM43A" "NUCB2" "MCUR1" "CXCL8" "VPREB3" "ENSG00000275063.1" "RASD1" "YWHAQ" "STIL" "TNFSF9" "NCAPD3" "RUVBL1" "SLC1A5" "UQCC2" "SH2D1A" "HOXA5" "IFI6" "HPGDS" "VDAC3" "FANK1" "ATAD3A" "SPINT2" "LSM2" "ACAT2" "TEX9" "GSTO1" "ENAM" "RGS13" "TRBV23-1" "TRBV29-1" "H2BC4" "KLRB1" "RMI2" "EXO1" "IGHV7-4-1" "PXMP2" "KEL" "TRAV40" "F11-ENSG00000088926" "PRTG"
[1035] "SHC4" "LINC01623" "TRAV29DV5" "SEPHS1" "AHSA1" "PTMS" "NFE2" "LGMN" "PRPF19" "TRBV20-1" "ACTL6A" "RNF130" "MED12L" "MYL6B" "STT3A" "B3GNT7" "RPS27L" "SLC44A1" "CHI3L2" "STAT1" "AC104024.1" "SELENOP" "TCEAL9" "MS4A4A" "RP11-1399P15.1" "KRT86" "IL22" "AJ006998.2" "HPSE" "MAL" "SNRPD1" "ISG20" "CTSG" "CRTAM" "CCNA1" "BARD1" "LINC01485" "MSH6" "H2BC11" "NOSIP" "STIP1" "HAUS1" "TOMM40" "CRIP2" "IGHV1-69" "IGFBP4" "NEFM"
[1082] "SMOX" "LINC01943" "ZNF704" "TRBV21-1" "IL2RA" "DDX11" "SPARC" "GPR183" "BCAT1" "TOX2" "MRPL4" "RP11-386I14.4" "LRRC25" "TRAV38-2DV8" "POLR2H" "H2BU1" "IL17RB" "MMP2" "AC015969.3" "HOXC9" "KCNK15-AS1" "HAT1" "GYPA" "CH17-373J23.1" "RP1-34B20.21" "CMSS1" "NCF2" "TSPYL2" "FAM136A" "SLC27A2" "MDH1" "ATF3" "RGS2" "MET" "OASL" "EPRS1" "GALNT14" "CD4" "PERP" "TMEM14C" "HPDL" "MRTO4" "C19orf48" "OXCT2" "CDK2AP2" "EGR4" "GDF10"
[1129] "UNC5B-AS1" "NUP37" "IRF8" "CEP152" "ICAM1" "PHACTR1" "TIPIN" "KIR2DL3" "SOCAR" "RAD51C" "CDKN2A" "S100A11" "CHN1" "GOT2" "LY86" "KCNK17" "CYCS" "LINC00582" "ZNF331" "MAGOHB" "CENPP" "IGLV3-27" "ST6GAL2" "RP11-480C16.1" "HLA-DPA1" "ZFP36" "GNL3" "CD36" "AKAP12" "EEF1E1" "IMPA2" "SCN3B" "CHST2" "CCL23" "PRIM1" "TRAV30" "ROR2" "TGFBI" "TMEM45A" "AKR1C3" "TAP1" "PDCD1" "TIGIT" "MRPL22" "TNFRSF9" "CD86" "CRYBA4"
[1176] "CHST15" "H2BC3" "PARVB" "GLRX5" "MT1X" "HMOX1" "RP11-251M1.1" "KCNIP4-IT1" "KIR3DL2" "CENPS" "PSMD14" "THOC3" "VEGFA" "IL2" "SOCS3" "CCNE2" "MRPS6" "H4C15" "CD320" "POLA2" "FOXC1" "PKIB" "FDPS" "CAMK1" "IGLV1-44" "CLEC9A" "TRAV38-1" "SPTSSB" "DKC1" "KIR2DL1" "FBXO16" "CDC25A" "NSD2" "SNAI1" "CD160" "MTHFD1L" "SYT1" "PGAP4" "CEND1" "IL18" "KDM6B" "OXCT1" "RPA1" "CFAP43" "F13A1" "NAP1L1" "HMGN1"
[1223] "DUSP1" "AP1S2" "CIT" "LMNB2" "NR4A2" "LIMCH1" "RAPGEF4" "EPB41L4B" "PCDH17" "KIAA0087" "INHBA" "KIAA1549" "ASCL1" "MTUS2-AS1" "RP11-538D16.3" "KIF4B" "TP73-AS3" "RP11-469A15.2" "DEPDC1-AS1" "RP11-298E9.5" "RP11-569G13.2" "ARNTL2-AS1" "LINC01511" "LINC02345" "RP4-555D20.4" "RP1-136B1.1" "U91328.22" "RP11-2H3.6" "BCL7A" "RP11-879F14.2" "RUBCNL" "RBBP7" "NEFH" "DSCC1" "EBP" "FCER2" "TRBC1" "DLGAP1" "BEX3" "MESD" "RFC3" "MS4A3" "INKA1" "PARPBP" "CLIC3" "ACKR3" "HLA-DMB"
[1270] "MEG3" "CCR9" "SIRPG" "CDC25C" "PLK4" "SPATS2L" "HCK" "NMU" "KIR2DL4" "ACOXL" "GSTP1" "CLDN5" "IL32" "NRARP" "CHAC2" "HYOU1" "HACD1" "MPEG1" "RAMP1" "GMPS" "HCAR3" "SCN9A" "PLAUR" "CCDC168" "TTN" "NELL2" "RP11-902B17.1" "KLK1" "TRBV18" "UNG" "TNFAIP2" "GATA3-AS1" "CX3CR1" "SPACA3" "EPB42" "AK8" "LAYN" "LMO7-AS1" "PGD" "KB-1980E6.3" "CCDC28B" "IFNGR2" "RBM47" "MAP3K8" "TNFSF10" "IGHV1-69-2" "PLAU"
[1317] "MYADM" "TPM1" "KRT81" "SYCE1" "SLC40A1" "LINC02362" "GTSF1" "IFNG-AS1" "PLXDC2" "TCHH" "CALN1" "FANCA" "IGHV4-61" "LINC01250" "ANKRD28" "PCSK1N" "SMIM1" "FANCD2" "IER5L" "AKR1B1" "OGN" "MAF" "MIXL1" "RP11-279O9.4" "MTRNR2L1" "NUPR2" "SAMD1" "MAFF" "GAL" "CENPK" "ANPEP" "TIFA" "CD68" "GFI1B" "XAF1" "SCAMP5" "ITGAX" "POLE2" "EPSTI1" "SLC1A4" "TRH" "RP13-192B19.2" "CTD-2574D22.7" "PRSS2" "CD83" "TPO" "KRT8"
[1364] "DHRS9" "DOK5" "CHRM3" "C15orf48" "LTK" "MMS22L" "CHAF1B" "OR2B6" "TRBV3-1" "DONSON" "PIMREG" "PAX5" "WNT4" "TNFSF13B" "ITGA1" "SLC7A5" "RSAD2" "CEP128" "APOBEC3H" "CENPV" "PF4V1" "LCN2" "H2AC17" "DAAM1" "PRF1" "H3C10" "SIT1" "PLEK" "CLMN" "MYCT1" "RPL39L" "DNTT" "CAP2" "CDH9" "COLEC11" "TCN1" "HMCN1" "CCDC39-ENSG00000145075" "SLC26A7" "P2RY6" "DIP2C-AS1" "TEAD1" "COL4A5" "CR1L" "SPINK13" "RP5-1119O21.2" "RP11-462B18.2"
[1411] "LGR4-AS1" "LINC02763" "ARLNC1" "AC005780.1" "MIR924HG" "RP3-483K16.4" "JDP2" "SPRY1" "CEACAM4" "SLC16A14" "CXCL3" "BUB3" "LINC01447" "FSCN1" "SLAMF7" "CCDC68" "MAP3K7CL" "DST" "CDKN1C" "CD82" "ALOX12" "RGS10" "IGHV2-70D" "IGLC6" "RBM20" "NFIB" "PIM1" "TLE1" "SLC11A1" "VPS9D1-AS1" "HMGN5" "CDK2" "TPM2" "SEPTIN11" "CCL18" "RAD54L" "WDR62" "CBR1" "DSG2" "CLIC2" "PLBD1" "FHL1" "HMGA2" "TRBV6-1" "CTA-992D9.11" "TMEM167A" "ZBP1"
[1458] "KIR3DX1" "CYP2C18" "SMIM43" "SAGE1" "GRID1" "AJAP1" "TDRG1" "TRBV11-1" "CATSPERZ" "NFE4" "AF015262.2" "FRGCA" "IGKV2D-30" "IGKV1D-33" "IGKV2D-24" "RP11-268P4.5" "IGHVIII-38-1" "BBOX1-AS1" "RP11-113K21.4" "SLC25A21-AS1" "CTC-453G23.8" "AC003002.6" "RP11-378A12.1" "RP11-434E6.4" "ERICH3" "ADTRP" "SPAG4" "SYNM" "SAMD7" "TRBV12-4" "CREM" "VIPR2" "PKIG" "MYO1D" "NFKBIZ" "ILK" "CSF3R" "CENPL" "HAVCR2" "KCNK6" "CYTOR" "C20orf27" "CTAG2" "N4BP3" "BAALC" "CTSB" "FOLR3"
[1505] "LEF1" "MPO" "IL7R" "TRAV22" "ACP5" "EGFL6" "EGR2" "STON2" "RP11-492E3.2" "CTDSPL" "FES" "BRIP1" "XRCC2" "SLC7A11" "LINC01980" "DNAJC12" "EGLN3" "METTL7A" "GAPT" "NXPH4" "MS4A7" "HACD3" "RECQL4" "HOMER3" "GADD45G" "AIM2" "S100A10" "SMCO4" "REG4" "LILRB2" "TMIGD2" "CREB5" "LILRA5" "PSMC3IP" "CYP2S1" "NRP2" "UNQ6494" "IL12A-AS1" "CFP" "CPPED1" "HOXA11" "CYP4B1" "ACSM3" "C21orf58" "NAMPT" "CDH1" "PFKP"
[1552] "MIR222HG" "DOK3" "LRRIQ1" "RP3-416J7.4" "ARX" "PREX2" "GPR12" "KLF15" "SLC16A9" "IFIT1B" "TRBV6-6" "MNX1-AS2" "LINC01936" "GXYLT1P5" "AC141930.3" "TLCD5" "ATP1B1" "CLEC12A" "PALD1" "GCSAML" "HMBS" "PEBP1" "PTPN3" "LIM2" "TYROBP" "POLR3G" "TRBC2" "WNT11" "FXYD7" "WHRN" "HERPUD1" "MT1F" "LY96" "KRT1" "MCEMP1" "LGALS3BP" "AP3M2" "GALM" "FLT3" "TRAV12-2" "HBQ1" "CD40LG" "ADA" "GADD45B" "GIMAP4" "GBP2" "FCGRT"
[1599] "TRABD2A" "PYCARD" "PLSCR1" "ESPL1" "ETV7" "SWAP70" "FGF18" "RP11-649E7.5" "CD28" "CSF2RB" "POU2F2" "TRAV6" "TCERG1L" "DHRS2" "LINC01857" "KLRF1" "GNG3" "DEPP1" "MXRA8" "TRAV5" "RBP7" "HEMGN" "CLDN14" "TRBV6-4" "TAFA5" "LINC00685" "TRGV3" "LRRN3" "TP73" "DDAH2" "TUBB6" "ITM2A" "TP63" "CD22" "IRAK3" "TWIST1" "SAP30" "CTTN" "CMC1" "DENND2D" "IRF7" "GMPR" "ABCA1" "SLC4A10" "ST14" "CXCL2" "KYNU"
[1646] "DBN1" "ZNF571" "DDX3Y" "SLC16A1-AS1" "SELENBP1" "SQOR" "PACSIN1" "PGRMC1" "RAB30" "PID1" "RCAN3" "BHLHE40" "IL5" "MDH1B" "TRPC6" "CGN" "PRSS35" "SLC30A8" "GAB4" "LINC01933" "RP11-214K3.24" "LA16c-306E5.2" "UAP1" "SASH1" "IL6" "SFRP5" "NRIP1" "IL4I1" "RP11-373D23.3" "BATF3" "NRP1" "LPP-AS1" "FERMT1" "SH3BP4" "LINC00891" "RNASEH2A" "ETFA" "NET1" "SLC9A3R1" "ADAM28" "IL1R2" "CYP1B1" "CORO1C" "RP11-106M7.1" "TDRD15" "NLRP3" "GBP4"
[1693] "ST6GALNAC1" "TMEM176A" "WEE1" "LINC02865" "PROK2" "AFF3" "RAB31" "OSM" "TNFRSF13C" "GSPT1" "C16orf74" "ZC3H12A" "ZNF80" "LILRB3" "GINS1" "SCARB2" "OTOF" "HSBP1" "DAB2" "H2BC15" "TMCC2" "LINC02195" "TRBV24-1" "EDA" "BMP2" "SAT1" "FCGR3A" "TEDC2" "HRH2" "SLC25A37" "MARCKSL1" "FAM184A" "FAM83D" "FGL2" "KRT7" "PRR15" "TRBV6-5" "RP11-143I21.1" "DPP4" "RP11-108L7.15" "OAS2" "MAGEB2" "IGKV6-21" "TEX49" "MRC2" "SELP" "PNP"
[1740] "IFI35" "TENT5C" "ANK1" "TTC39C-AS1" "KCTD14" "CNTNAP2" "NR4A3" "CASP3" "CR1" "LL22NC03-2H8.5" "KRT18" "ARHGAP6" "PIM3" "ENC1" "PLOD2" "NOP58" "IL13RA1" "ZP1" "PYGL" "TRAV17" "DMXL2" "LIMS2" "CPM" "IGSF11" "METTL7B" "FAM133A" "C10orf67" "RP11-474I16.8" "AC005616.1" "RP3-333B15.5" "TRBV15" "SEPTIN3" "F2-ENSG00000180210" "RP11-104J23.1" "RP11-214O1.3" "FGFBP2" "C12orf45" "PLK2" "POU3F1" "RRAD" "PDZK1IP1" "KLF10" "WDFY4" "PECAM1" "C12orf75" "DAPP1" "CITED2"
[1787] "TMEM109" "S100A4" "LINC02397" "FXYD1" "HBEGF" "IGHV3-64" "NEAT1" "MECOM" "RP11-693J15.6" "ADAMTS2" "ASGR1" "PARD6G" "IGFBP2" "PHLDA1" "H2AW" "MARCHF1" "GM2A" "PAGE5" "JAML" "KLF4" "GTSCR1" "CD163" "RP6-91H8.3" "HERC5" "TRAV25" "TRAV13-1" "SYNGR1" "LYN" "CD93" "TFRC" "ANKRD9" "SYNJ2" "GLB1L3" "LY6E" "TRAV1-2" "ARG2" "GALR3" "CREG1" "PTX3" "COL5A3" "CTSA" "SMIM3" "HNMT" "PPP1R9A" "RASD2" "OPRM1" "PRSS33"
[1834] "BICDL1" "HSPB1" "PHLDA2" "SORBS2" "DYSF" "RAB11A" "SCNN1B" "RPRM" "SPDYC" "BACE2" "C5AR1" "GCA" "FFAR3" "ARC" "TRBV5-6" "RP11-342K6.4" "TRBV25-1" "LINC00926" "TCEAL2" "TIMP1" "TRAV9-2" "GATA3" "FGF14" "CLEC7A" "ETS2" "TRMT6" "EDN3" "USP18" "APOBEC3G" "CD302" "GZMH" "STRBP" "LGALS9" "GAS2L1" "SFXN1" "CISH" "RP11-605B16.1" "SLC7A7" "FKBP5" "TUBA4A" "ADGRE2" "XK" "F5-ENSG00000198734" "IRS2" "PITPNM2-AS1" "HSD11B1" "TEX48"
[1881] "SH2D1B" "RBPMS2" "HLA-DPB1" "IL2RB" "GLA" "GPR155" "TICRR" "DNAJA4" "PDE4B" "PROS1" "SMIM14" "C19orf38" "CD5" "PTGS1" "ALCAM" "TRAT1" "CCDC80" "CREG2" "NIBAN3" "CYP4F3" "NEIL1" "SERTAD1" "PAK1" "OSBPL10" "EBF1" "CDA" "CCR2" "TMEM70" "GBP5" "TUBB2A" "MIR155HG" "TLCD4" "LDLR" "LINC01934" "SAMSN1" "MAP3K20" "TOB1" "TRIM58" "PLK3" "FCGR1A" "C1orf162" "MIR223HG" "FRMD4B" "HDAC9" "LGALS14" "AUTS2" "RAB27B"
[1928] "GPR171" "SKIL" "MID1IP1" "SAMD9L" "DEFA3" "SLC43A2" "PMEPA1" "SOCS1" "IL3RA"
We can see how by recomputing the HVG we replace 33% of the HVG genes which capture the variability present in the T cell subset.
It looks like UMAP_1 is clearly separating 2 populations… let’s look at some QC metrics to see what might be going on by looking at the predicted.id
DimPlot(
tse_sub,
dims = c(1, 2),
reduction = "pca",
group.by = "predicted.id",
label = TRUE)It appears that we have some myeloid and B cells within our T cell subset…. Let’s check the PCA loadings
# Examine and visualize PCA results a few different ways
print(tse_sub[["pca"]], dims = 1:10, nfeatures = 15)PC_ 1
Positive: TYMS, PCLAF, MKI67, BIRC5, ACTB, STMN1, RRM2, TK1, CDT1, ZWINT
MYBL2, PFN1, CDK1, PSME2, GAPDH
Negative: IL7R, CCR7, LTB, RCAN3, MAL, JUNB, RGCC, TRABD2A, ZNF331, TSHZ2
LEF1, CD69, SESN3, MYADM, RPS18
PC_ 2
Positive: LTB, IL7R, RCAN3, RPS2, RPLP0, RPS18, CCR7, RPS5, MAL, LDHB
GPR183, LEF1, COTL1, CD28, TYMS
Negative: GZMB, PRF1, FGFBP2, FCGR3A, GNLY, GZMH, GZMA, TYROBP, SPON2, CLIC3
KLRF1, CCL4, FCER1G, TRDC, PLEK
PC_ 3
Positive: TYROBP, MAP3K8, B3GNT7, MAFF, BHLHE40, IGFBP7, KLRF1, KLRB1, IER2, IER5
TAMALIN, FCER1G, SPON2, AREG, TYMS
Negative: TRAC, IL32, CD52, S100A8, VIM, S100A9, S100A11, LDHB, RPS4Y1, TMSB10
RPLP1, COTL1, LTB, RPS2, HBB
PC_ 4
Positive: ZFP36, JUNB, CD69, DUSP2, JUN, NR4A2, DUSP1, IER2, NFKBIA, FOS
RGS1, CREM, IER5, PPP1R15A, ZNF331
Negative: CX3CR1, MT-ATP6, FCGR3A, KLRC2, PLEK, SYNGR1, TRDC, FGFBP2, TMPO, MTRNR2L12
RRM2, ASCL2, UBE2C, MYBL2, H1-3
PC_ 5
Positive: IL32, TMSB4X, CD8A, S100A4, CD52, HLA-DRB1, GZMH, HLA-DPB1, CD8B, GZMA
TRGC2, S100A10, HLA-DPA1, DUSP2, TRBC2
Negative: MZB1, JCHAIN, TNFRSF17, IGKC, ITM2C, POU2AF1, DERL3, IGHG4, IGHG1, IGHG3
TNFRSF13B, COBLL1, PNOC, IGLC2, IGHA1
PC_ 6
Positive: HLA-DRB1, MZB1, JCHAIN, HERPUD1, CD8A, HLA-DPB1, HLA-DPA1, TNFRSF17, GZMK, TRGC2
CD8B, HLA-DRA, HLA-DQB1, DUSP2, CD74
Negative: FCER1G, NRGN, PPBP, CAVIN2, IGFBP7, PF4, MAL, CCR7, TUBB1, TMIGD2
ACTN1, LTB, BEX3, RPS4Y1, GNG11
PC_ 7
Positive: RPL37A, RPS18, RPLP1, RPS5, LTB, RPS2, IGFBP7, NPM1, MAL, C1orf162
PTMA, KLRF1, RPLP0, RCAN3, TMIGD2
Negative: NRGN, PPBP, CAVIN2, PF4, GNG11, TUBB1, S100A9, S100A8, SPARC, TREML1
GP9, CLU, ACRBP, RGS18, CD8A
PC_ 8
Positive: S100A8, S100A9, LYZ, S100A12, VCAN, HBB, MNDA, FOS, GBP1, NFKBIA
STAT1, CXCL8, TYMP, TMEM176B, DUSP1
Negative: PF4, CAVIN2, PPBP, NRGN, TUBB1, GNG11, GP9, TREML1, ACRBP, CMTM5
SPARC, MYL9, PRKAR2B, CLU, TMEM40
PC_ 9
Positive: IFI44L, XAF1, ISG15, MTRNR2L12, NEAT1, DUSP4, MX1, IFI16, CREM, EPSTI1
ZNF331, CAPZA1, RGS1, AUTS2, IRF4
Negative: MYADM, FTL, TAGLN2, RPS5, RPS2, DUSP1, MT-CO3, FOS, MT-CYB, LMNA
FOSB, MT-CO2, MIR23AHG, PPP1R15A, RGCC
PC_ 10
Positive: MTRNR2L12, TMSB4X, S100A4, CRIP1, IFI44L, ITGB1, IL32, S100A10, LMNA, XAF1
S100A11, DLGAP5, LGALS1, CENPA, UBE2C
Negative: MT-CO3, AIF1, MT-CO2, GINS2, FABP5, CD8A, CD8B, E2F1, CDCA7L, MCM2
SOX4, HMGA1, MCM5, CDT1, MCM4
Some interesting PCs are the following:
- PC1 contains a lot of proliferation related genes
- PC5 contains a lot of B cell related genes in the negative loadings
- PC8 contains a lot of myeloid genes with high loadings
DimPlot(
tse_sub,
dims = c(1, 5),
reduction = "pca",
group.by = "predicted.id",
label = TRUE) | DimPlot(
tse_sub,
dims = c(1, 8),
reduction = "pca",
group.by = "predicted.id",
label = TRUE)Let’s check some QC metrics to assess if we have potential doublets
VlnPlot(
tse_sub,
features = c("pANN", "nCount_RNA", "nFeature_RNA"),
group.by = "RNA_snn_res.0.15",
pt.size = 0
)Clusters 6 & 7 have elevated pANN scores while cluster 3 has slightly elevated library complexity. Let’s check some cell type genes to determine what might be going on:
ct_genes <- c(
# Proliferating genes
"MKI67", "TOP2A", "STMN1",
# B cell genes
"CD79A", "CD79B", "MS4A1",
# Myeloid genes
"S100A8", "S100A9", "CD14",
"FCGR3A", "LYZ", "VCAN",
# T cell genes
"CD3D", "CD3E", "CD8B",
"TRAC", "TRBC1", "TRDC",
# NK genes
"NCR1", "NCR2", "NCR3",
"KLRF1", "KLRC2"
)
Seurat::DotPlot(
object = tse_sub,
features = ct_genes,
group.by = "RNA_snn_res.0.15",
col.min = 0,
dot.min = 0) +
ggplot2::scale_x_discrete(
breaks = ct_genes) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 60, hjust = 1)) +
ggplot2::labs(x = "", y = "")Clusters 6 and 7 seem to be proliferating cells. 6 appeard to have a mix of T and myeloid cells and 7 are B cells. Moreover, cluster 3 is simultaneously expressing Myeloid and T cell genes which could be an indication that these are doublets. Let’s double check these cells are in fact doublets:
# Scale gene expression of genes of interest
tse_sub <- ScaleData(tse_sub, features = ct_genes)
DoHeatmap(
tse_sub,
features = ct_genes,
group.by = "RNA_snn_res.0.15")Effectively, cluster 3 are doublets!
T cell clean up
The whole purpose of iterative subclustering is to capture the biological variability of a specific cell type. Clearly, our level 1 annotation of T cells included more than just T cells, for the prupose of this notebook let’s simulate a level-3 annotation where we removed clusters not of interest and focus on T cells!
# Keep only clusters 0, 1, 2, 4, 5
tse_sub <- tse_sub[, tse_sub$RNA_snn_res.0.15 %in% c(0, 1, 2, 4, 5)]Reprocess it -
tse_sub <- tse_sub %>%
FindVariableFeatures(
method = "vst",
nfeatures = 3000,
verbose = FALSE) %>%
ScaleData(verbose = FALSE, features = VariableFeatures(.)) %>%
RunPCA(
npcs = 50,
verbose = FALSE
)
# Save HVG for T cells only
hvg_t <- VariableFeatures(tse_sub)
# Look at the elbow plot
ElbowPlot(tse_sub, ndims = 50)tse_sub <- tse_sub %>%
FindNeighbors(dims = 1:30, verbose = FALSE) %>%
FindClusters(resolution = c(0.15, 0.2, 0.25, 0.3), verbose = FALSE) %>%
RunUMAP(dims = 1:30, verbose = FALSE)Let’s take a look at these clusters
DimPlot(
tse_sub,
reduction = 'umap',
group.by = c(
"RNA_snn_res.0.15", "RNA_snn_res.0.2",
"RNA_snn_res.0.25", "RNA_snn_res.0.3"))dim_sub <- DimPlot(
tse_sub,
reduction = 'umap',
group.by = c("RNA_snn_res.0.15"))As showcased in the clustering algorithm a data-driven way to guide the decision making of the right clustering would be to run the Silhouette Analysis. For the purpose of this notebook we are going to move forward with 0.15.
dim_full + dim_subLet’s compare the clusters with those obtained with the original full HVG genes
temp_df <- tse_sub@meta.data %>%
rownames_to_column("bc") %>%
dplyr::select(bc, RNA_snn_res.0.15) %>%
dplyr::rename(sub_hvg = RNA_snn_res.0.15)
data_alluvial2 <- tse_full@meta.data %>%
rownames_to_column("bc") %>%
dplyr::rename(full_hvg = RNA_snn_res.0.15) %>%
left_join(temp_df, by = "bc") %>%
mutate(sub_hvg = if_else(is.na(sub_hvg), "removed", sub_hvg)) %>%
dplyr::count(full_hvg, sub_hvg)
ggplot(data = data_alluvial2,
aes(axis1 = full_hvg, axis2 = sub_hvg, y = n)) +
geom_alluvium(aes(fill = sub_hvg), width = 0.1) +
geom_stratum(width = 0.1) +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_minimal() +
scale_fill_brewer(palette = "Dark2") labs(title = "Alluvial plot of Clustering using full data HVG or subcluster HVG",
x = "Cluster assignment",
y = "N")$x
[1] "Cluster assignment"
$y
[1] "N"
$title
[1] "Alluvial plot of Clustering using full data HVG or subcluster HVG"
attr(,"class")
[1] "labels"
Annotation playground
Let’s find the marker genes of this new clustering
Idents(tse_sub) <- tse_sub$RNA_snn_res.0.15
mgs <- FindAllMarkers(
tse_sub,
logfc.threshold = 0.25,
only.pos = TRUE,
min.pct = 0.25)
DT::datatable(mgs, filter = "top")Do you think you can annotate these new clusters….
Cluster 1
Let’s look at genes that are differentially expressed
c1 <- c(
"KLRF1", "FCER1G", "FCGR3A",
"TRDC", "GZMB", "GNLY")
FeaturePlot(
tse_sub,
features = c(c1),
ncol = 3) +
dim_subVlnPlot(
tse_sub,
features = c(c1),
group.by = "RNA_snn_res.0.15") +
dim_subCluster 4
c4 <- c(
"KLRF1", "FCER1G", "FCGR3A",
"TIGIT", "CX3CR1", "KLRC2")
FeaturePlot(
tse_sub,
features = c(c4),
ncol = 3) +
dim_subVlnPlot(
tse_sub,
features = c(c4),
group.by = "RNA_snn_res.0.15",
pt.size = 0) +
dim_subWe can see how TIGIT is a gene uniquely expressed in cluster 4 which has come off of 0. How would this look like in the embedding with the full dataset
FeaturePlot(
tse_full,
features = "TIGIT")Cluster 7
This cluster seems to be expressin Hemoglobin and MHC2 genes….
c7 <- c(
"KLRF1", "FCER1G", "FCGR3A",
"HBB", "HBA1", "HBA2",
"HLA-DQA1", "HLA-DQA2")
FeaturePlot(
tse_sub,
features = c(c7),
ncol = 3) +
dim_subVlnPlot(
tse_sub,
features = c(c7),
group.by = "RNA_snn_res.0.15",
pt.size = 0) +
dim_subAnnotate
tse_sub@meta.data <- tse_sub@meta.data %>%
dplyr::mutate(
annotation_lvl2 = dplyr::case_when(
RNA_snn_res.0.15 == 0 ~ "NK",
RNA_snn_res.0.15 == 1 ~ "",
RNA_snn_res.0.15 == 2 ~ "",
RNA_snn_res.0.15 == 3 ~ "",
RNA_snn_res.0.15 == 4 ~ "NK TIGIT+/CX3CR1+/KLRC2+",
RNA_snn_res.0.15 == 5 ~ "",
RNA_snn_res.0.15 == 6 ~ "",
RNA_snn_res.0.15 == 7 ~ "Doublets/lowQ",
RNA_snn_res.0.15 == 8 ~ ""
)
)
DimPlot(tse_sub, group.by = "annotation_lvl2")Comparing PCA
Lastly, lets compare the PC spaces between using HVG selected from the whole data or from the cleaned up T cell subset
So let’s take a closer look at what changes in the PCA latent spaces. Maybe the information we need is there.
latent_full <- tse_full@reductions$pca@cell.embeddings
loadings_full <- tse_full@reductions$pca@feature.loadings
var_full <- tse_full@reductions$pca@stdev
t_full <- tse_full@reductions$pca@misc$total.variance
latent_sub <- tse_sub@reductions$pca@cell.embeddings
loadings_sub <- tse_sub@reductions$pca@feature.loadings
var_sub <- tse_sub@reductions$pca@stdev
t_sub <- tse_sub@reductions$pca@misc$total.variance
var_full_df <- data.frame(
PC = 1:length(var_full),
Variance = var_full^2 / t_full,
Set = "Full HVG")
var_sub_df <- data.frame(
PC = 1:length(var_sub),
Variance = var_sub^2 / t_sub,
Set = "Clean HVG")
# Calculate cumulative variance explained
var_full_df$CumulativeVariance <- cumsum(var_full_df$Variance)
var_sub_df$CumulativeVariance <- cumsum(var_sub_df$Variance)
# Combine the datasets
combined_variance <- bind_rows(var_full_df, var_sub_df)
# Plotting the variance explained and cumulative variance
ggplot(combined_variance, aes(x = PC)) +
geom_line(aes(y = CumulativeVariance, color = Set), size = 1.2) +
geom_point(aes(y = CumulativeVariance, color = Set), size = 2) +
scale_color_manual(values = c("blue", "red")) +
scale_linetype_manual(values = c("solid", "dashed")) +
theme_minimal() +
labs(title = "Cumulative Variance Explained",
x = "Principal Component",
y = "Proportion of Variance Explained") +
guides(color = guide_legend(title = "Parameter Set"),
linetype = guide_legend(title = "Variance Type"))Both seem to be very similar, and it makes sense, since they are gene sets obtained by selecting the most variable genes from their respective spaces. However, the biological variance they capture is very different! We can look at that by comparing how the PC loadings of genes of interest change.
# Function to compare PC loadings between 2 embeddings, it basically compares the difference between the max value of each gene between both PC spaces
compare_pca_loadings <- function(loadings1, loadings2, top_n = 20, cell_type_markers = NULL,
loadings1_name = deparse(substitute(loadings1)),
loadings2_name = deparse(substitute(loadings2))) {
all_genes <- union(rownames(loadings1), rownames(loadings2))
# Initialize matrices to store aligned loadings with genes set to 0 by default
loadings1_aligned <- matrix(0, nrow = length(all_genes), ncol = ncol(loadings1),
dimnames = list(all_genes, colnames(loadings1)))
loadings2_aligned <- matrix(0, nrow = length(all_genes), ncol = ncol(loadings2),
dimnames = list(all_genes, colnames(loadings2)))
# Fill in the existing values from each matrix
loadings1_aligned[rownames(loadings1), ] <- loadings1
loadings2_aligned[rownames(loadings2), ] <- loadings2
# Check for zero values immediately after filling in values
loadings1_zero <- apply(loadings1_aligned, 1, function(x) all(x == 0))
loadings2_zero <- apply(loadings2_aligned, 1, function(x) all(x == 0))
outlines <- character(length(all_genes))
outlines[loadings1_zero] <- "orange"
outlines[loadings2_zero] <- "blue"
# outlines[loadings1_zero & loadings2_zero] <- "white" # Both are zero
max_loadings1 <- apply(abs(loadings1_aligned), 1, max)
max_loadings2 <- apply(abs(loadings2_aligned), 1, max)
# Calculate the differences between the maximum absolute loadings across all PCs
loadings_difference <- max_loadings2 - max_loadings1
# Determine the direction of the difference
directionality <- ifelse(loadings_difference > 0, paste("Higher in", loadings2_name), paste("Higher in", loadings1_name))
fillpal <- c('steelblue','salmon')
names(fillpal) = c(paste("Higher in", loadings2_name), paste("Higher in", loadings1_name))
# Create a data frame for plotting
genes_differences_df <- data.frame(
Feature = rownames(loadings1_aligned), # Assuming both matrices have the same rownames
Difference = loadings_difference,
Direction = directionality,
Outline = outlines[match(rownames(loadings1_aligned), all_genes)]
)
genes_differences_df <- genes_differences_df %>% mutate(
MaxAbsLoadings1 = ifelse(Direction == names(fillpal)[1], -max_loadings1, max_loadings1),
MaxAbsLoadings2 = ifelse(Direction == names(fillpal)[2], -max_loadings2, max_loadings2))
if (!is.null(cell_type_markers)) {
marker_genes <- unlist(cell_type_markers, use.names = FALSE)
genes_differences_df <- genes_differences_df %>%
dplyr::filter(Feature %in% marker_genes) %>%
dplyr::mutate(CellType = NA) # Assign NA initially
for (cell_type in names(cell_type_markers)) {
genes_differences_df$CellType[genes_differences_df$Feature %in% cell_type_markers[[cell_type]]] <- cell_type
}
plot <- ggplot(
genes_differences_df,
aes(x = reorder(Feature, Difference), y = Difference, fill = Direction)) +
geom_col() +
coord_flip() +
theme_minimal() +
scale_fill_manual(values = fillpal) +
labs(title = "Gene Loadings Differences by Cell Type",
subtitle = paste("Comparing", loadings1_name, "and", loadings2_name),
x = "Gene",
y = "Difference in Loadings") +
guides(fill = guide_legend(title = "Where Loadings are Higher")) +
facet_wrap(~CellType, scales = "free_y", ncol = 1, drop = TRUE)
} else {
genes_differences_df <- genes_differences_df %>%
dplyr::arrange(desc(Difference)) %>%
dplyr::slice(1:top_n)
plot <- ggplot(
genes_differences_df,
aes(x = reorder(Feature, Difference), y = Difference, fill = Direction)) +
geom_col(show.legend = FALSE) +
scale_color_manual(
name = "Outline Color",
values = c("green" = "green", "blue" = "blue", "purple" = "purple"),
labels = c(paste("Zero in", loadings1_name),
paste("Zero in", loadings2_name),
"Zero in Both")) +
coord_flip() +
theme_minimal() +
scale_fill_manual(values = fillpal) +
labs(
title = paste("Top", top_n, "Genes with Greatest Differences in Loadings"),
subtitle = paste("Comparing", loadings1_name, "and", loadings2_name),
x = "Gene",
y = "Difference in Loadings") +
guides(fill = guide_legend(title = "Where Loadings are Higher"))
}
return(plot)
}Let’s compare which genes change the most between the full HVG and clean HVG
# Ensure both matrices have the same genes in the same order
common_genes <- intersect(rownames(loadings_full), rownames(loadings_sub))
loadings_full_aligned <- loadings_full[common_genes, , drop = FALSE]
loadings_sub_aligned <- loadings_sub[common_genes, , drop = FALSE]
# Determine high loadings
cutoff_l <- quantile(abs(as.matrix(loadings_full_aligned)), 0.95)
cutoff_h <- quantile(abs(as.matrix(loadings_sub_aligned)), 0.95)
high_loadings_full <- apply(abs(loadings_full_aligned), 1, max) > cutoff_l
high_loadings_sub <- apply(abs(loadings_sub_aligned), 1, max) > cutoff_h
# Identify features that are high in _h but not high in _l
target_features <- names(which(high_loadings_sub & !high_loadings_full))
# Prepare loadings of these target features for plotting
target_loadings <- loadings_sub_aligned[target_features, , drop = FALSE]
# Convert to long format for ggplot using pivot_longer
loadings_fullong <- as.data.frame(target_loadings) %>%
tibble::rownames_to_column("Feature") %>%
pivot_longer(
cols = -Feature,
names_to = "PC",
values_to = "Loading"
)
# Example usage:
compare_pca_loadings(loadings_full, loadings_sub, top_n = 40)Many of these genes are trained immunity genes. At this point we could keep some rare positive-control celltype in mind, like ILC3s or something else we might expect to find in this dataset
# Define a list of cell type markers
cell_type_markers <- list(
Tcell = c("TRAC", "TRBC1", "TRBC2", "CD3D", "CD3E"),
CD8cyto = c("CD8A", "CD8B", "GZMA", "NKG7", "GZMK"),
CD4 = c("CD4"),
Trespone = c("CD40LG", "CD28"),
Naive = c("CCR7", "LEF1", "TCF7"),
Treg = c("FOXP3", "CTLA4", "IL2RA"),
NK = c("NCR1", "NCR2", "NCR3", "KLRF1", "KLRC2"),
Myeloid = c("S100A8", "S100A9", "CD14", "FCGR3A", "LYZ", "VCAN"),
Bcell = c("CD79A", "CD79B", "MS4A1")
)
compare_pca_loadings(loadings_full, loadings_sub, cell_type_markers = cell_type_markers)And if we compare the loadings for genes of interest
compare_pca_loadings(loadings_full, loadings_sub, cell_type_markers = cell_type_markers)Session Info
sessionInfo()R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggalluvial_0.12.5 tictoc_1.2.1 RColorBrewer_1.1-3 scales_1.3.0 DT_0.33 colorBlindness_0.1.9 devtools_2.4.5 usethis_2.2.2 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.0 tidyverse_2.0.0 Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-3
loaded via a namespace (and not attached):
[1] RcppAnnoy_0.0.22 splines_4.3.1 later_1.3.2 polyclip_1.10-6 fastDummies_1.7.3 lifecycle_1.0.4 globals_0.16.2 processx_3.8.2 lattice_0.21-8 MASS_7.3-60 crosstalk_1.2.1 magrittr_2.0.3 limma_3.56.2 plotly_4.10.4 sass_0.4.8 rmarkdown_2.26 jquerylib_0.1.4 yaml_2.3.8 remotes_2.5.0 httpuv_1.6.14 sctransform_0.4.1 spam_2.10-0 sessioninfo_1.2.2 pkgbuild_1.4.2 spatstat.sparse_3.0-3 reticulate_1.35.0.9000 cowplot_1.1.3 pbapply_1.7-2 abind_1.4-5 pkgload_1.3.2.1 Rtsne_0.17 presto_1.0.0 ggrepel_0.9.5 irlba_2.3.5.1 listenv_0.9.1 spatstat.utils_3.0-4 goftest_1.2-3 RSpectra_0.16-1 spatstat.random_3.2-2 fitdistrplus_1.1-11 parallelly_1.37.0 leiden_0.4.3.1 codetools_0.2-19 tidyselect_1.2.0 farver_2.1.1 matrixStats_1.2.0 spatstat.explore_3.2-6 jsonlite_1.8.8 ellipsis_0.3.2 progressr_0.14.0 ggridges_0.5.6
[52] survival_3.5-7 tools_4.3.1 ica_1.0-3 Rcpp_1.0.12 glue_1.8.0 gridExtra_2.3 xfun_0.42 withr_3.0.0 fastmap_1.1.1 fansi_1.0.6 callr_3.7.3 digest_0.6.34 timechange_0.2.0 R6_2.5.1 mime_0.12 gridGraphics_0.5-1 colorspace_2.1-0 scattermore_1.2 tensor_1.5 spatstat.data_3.0-4 utf8_1.2.4 generics_0.1.3 data.table_1.15.4 prettyunits_1.1.1 httr_1.4.7 htmlwidgets_1.6.4 uwot_0.1.16 pkgconfig_2.0.3 gtable_0.3.4 lmtest_0.9-40 htmltools_0.5.7 profvis_0.3.8 dotCall64_1.1-1 png_0.1-8 knitr_1.45 rstudioapi_0.16.0 tzdb_0.4.0 reshape2_1.4.4 nlme_3.1-163 cachem_1.0.8 zoo_1.8-12 KernSmooth_2.23-22 parallel_4.3.1 miniUI_0.1.1.1 vipor_0.4.5 ggrastr_1.0.2 pillar_1.9.0 grid_4.3.1 vctrs_0.6.5 RANN_2.6.1 urlchecker_1.0.1
[103] promises_1.2.1 xtable_1.8-4 cluster_2.1.4 beeswarm_0.4.0 evaluate_0.23 cli_3.6.3 compiler_4.3.1 rlang_1.1.3 crayon_1.5.2 future.apply_1.11.1 labeling_0.4.3 ps_1.7.5 plyr_1.8.9 fs_1.6.3 ggbeeswarm_0.7.2 stringi_1.8.3 viridisLite_0.4.2 deldir_2.0-2 munsell_0.5.0 lazyeval_0.2.2 spatstat.geom_3.2-8 Matrix_1.6-5 RcppHNSW_0.6.0 hms_1.1.3 patchwork_1.2.0 future_1.33.1 shiny_1.8.0 ROCR_1.0-11 igraph_2.0.2 memoise_2.0.1 bslib_0.6.1